Summary: The data appears to be generated by a hyperbolic paraboloid function. My best approximation of the function is \(C = 10 - 0.4A^2B + 2A - 3B\).
Exploration and modeling:
First, I tried plotting the predictor variables against the target.
data <- read_csv('./adops_data_regression.csv', col_names = c("A", "B", "C"))
par(mfrow = c(1, 2))
with(data, plot(A, C))
with(data, plot(B, C))
One point seems to be an extreme outlier. I also tried plotting the predictors against each other.
with(data, plot(A, B))
There is no discernable relationship between them. Variable A seems to be right skewed, while B seems to be uniformly distributed.
par(mfrow = c(1,2))
hist(data$A)
hist(data$B)
I tried fitting a linear model of the form \(\hat{C} = b_0 + b_1 A + b_2 B\).
m1 <- lm(C ~ A + B, data = data)
summary(m1)
##
## Call:
## lm(formula = C ~ A + B, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9903.2 -5.1 23.0 53.8 333.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.294 44.427 -0.434 0.664
## A -8.067 6.016 -1.341 0.181
## B -1.778 11.427 -0.156 0.876
##
## Residual standard error: 581.4 on 297 degrees of freedom
## Multiple R-squared: 0.006053, Adjusted R-squared: -0.00064
## F-statistic: 0.9044 on 2 and 297 DF, p-value: 0.4059
This model seems to be a poor fit based on high p-values for all coefficients and the model as a whole. I checked diagnostic plots and found that value 201 is an extreme outlier, as expected.
par(mfrow = c(2,2))
plot(m1)
I removed the outlier value and refit the model.
m2 <- lm(C ~ A + B, data = data[-201,])
summary(m2)
##
## Call:
## lm(formula = C ~ A + B, data = data[-201, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -229.41 -26.36 1.93 32.99 166.43
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.7320 4.2640 6.035 4.75e-09 ***
## A -1.3667 0.5776 -2.366 0.0186 *
## B -15.2219 1.0974 -13.871 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 55.71 on 296 degrees of freedom
## Multiple R-squared: 0.394, Adjusted R-squared: 0.3899
## F-statistic: 96.23 on 2 and 296 DF, p-value: < 2.2e-16
This is a better fit, but diagnostics reveal that the residuals are not normally distributed or independent.
par(mfrow = c(2,2))
plot(m2)
To dig deeper, I created a 3d scatter plot of the three variables.
library(plotly)
plot_ly(data[-201,], x = ~A, y = ~B, z = ~C, marker=list(color='blue', size=2)) %>%
add_markers()
The target variable seems to be generated by a hyperbolic paraboloid function. After some trial and error, I fit another model of the form \(\hat{C} = b_0 + b_1 A^2 B + b_2 A + b_3 B\).
m3 <- lm(C ~ I(A^2*B) + A + B, data = data[-201,])
summary(m3)
##
## Call:
## lm(formula = C ~ I(A^2 * B) + A + B, data = data[-201, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.44389 -0.02826 -0.00464 0.02317 1.51817
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.0078000 0.0264079 379.0 <2e-16 ***
## I(A^2 * B) -0.4002566 0.0001409 -2841.0 <2e-16 ***
## A 2.0018307 0.0036936 542.0 <2e-16 ***
## B -2.9998073 0.0079164 -378.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3374 on 295 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.44e+06 on 3 and 295 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(m3)
While the issues with normality and independence of the residuals have not been eliminated, the residuals have been greatly reduced. Time permitting, it may be possible to refine the model to further reduce the residuals.